Final Project

Analyzing Taiwan’s AQI

Mitchell Levy

Professor Frank

Data Visualization

12-12-2024

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(lubridate)
library(Stat2Data)
library(maps)

Attaching package: 'maps'

The following object is masked from 'package:purrr':

    map
library(sf)
Linking to GEOS 3.11.0, GDAL 3.5.3, PROJ 9.1.0; sf_use_s2() is TRUE
library(RColorBrewer)
library(dplyr)
library(gganimate)
library(gifski)
library(gridExtra)

Attaching package: 'gridExtra'

The following object is masked from 'package:dplyr':

    combine
air <- read_csv("taiwan_air_quality_mod.csv")
New names:
• `` -> `...1`
Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
  dat <- vroom(...)
  problems(dat)
Rows: 5882208 Columns: 25
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (12): sitename, county, pollutant, co, o3, o3_8hr, pm10, pm2.5, windspe...
dbl  (11): ...1, aqi, so2, no2, nox, no, pm2.5_avg, so2_avg, longitude, lati...
lgl   (1): unit
dttm  (1): date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
air

Background Research

Variable Abbreviations:

  • AQI: Air Quality Index, the higher the worse it is.
  • pollutant: The type of pollutant it is.
  • so2: Sulfur Dioxide in ppb
  • co: Carbon Monoxide in ppm
  • o3: Ozone in ppb
  • o3_8hr 8-hour average of Ozone
  • pm10: Particulate matter under 10μm
  • pm2.5: Particulate matter under 2.5μm
  • no2: Nitrogen Dioxide in ppb
  • nox: Nitrogen Oxides in ppb
  • no: Nitric Oxide in ppb
  • co_8hr: 8-hour average of CO
  • pm2.5_avg: Moving average of PM2.5
  • pm10_avg: Moving average of PM10
  • so2_avg: Moving average of SO2

PM2.5: Fine particulate matter with a diameter of less than 2.5 micrometers. These particles are capable of penetrating deep into the lungs and bloodstream, causing respiratory and cardiovascular issues.

PM10: Particulate matter with a diameter less than 10 micrometers. These particles can cause respiratory irritation but are not as invasive as PM2.5.

SO2: Sulfur Dioxide is a toxic gas produced primarily by fossil fuel combustion, which can lead to respiratory problems.

NOx: A group of nitrogen oxides (including NO2 and NO) generated by combustion engines, contributing to smog and acid rain.

CO: Carbon Monoxide is a colorless, odorless gas that can be harmful when inhaled in large amounts. It primarily originates from vehicle emissions and other combustion sources.

O3: Ozone is a gas that forms in the atmosphere and is harmful at ground level. It’s a major component of smog and can cause respiratory issues.

Source for this information: https://www.kaggle.com/datasets/taweilo/taiwan-air-quality-data-20162024

This data set is about Taiwan’s air quality, or AQI. Having a low or high AQI directly translates into better or worse health for the people living there; it is a crucial variable for determining how healthy a location is. I am going to ask 6 questions, all relating to Taiwan’s AQI. Hopefully, this “report” will help showcase the factors that determine air quality so we can understand what causes good or poor AQI scores.

Question 1. Which type of main pollutant is most prevalant in the counties, and what are their AQI’s?

First, we need to clean some of the data so that we can make a visualization.

To start, we have to get the date in a better format.

air |>
  mutate(Date = as.Date(date, format = "%d/%m/%y")) |>
  mutate(Year = format(Date, "%Y")) |>
  mutate(Month = format(Date, "%m")) |>
   mutate(Day = format(Date, "%d"))-> air.date
air.date
air.date |>
  filter(Year == 2023) |>
  filter(Month == "06" | Month == "07" | Month == "08") |>
  drop_na(county, pollutant, aqi) |>
  group_by(pollutant) |>
  summarise(count = n(), aqi.bar = mean(aqi)) -> air.pollutant
air.pollutant
air.pollutant |> 
  ggplot(aes(x = pollutant, y = count, fill = aqi.bar)) +
  geom_bar(stat = "identity") +
  scale_fill_gradient(low = "deepskyblue", high = "darkblue") +
  labs(title = "Number of Each Pollutant in the Summer of 2023",
       x = "Pollutant",
       y = "Count",
       fill = "Average AQI") +
  theme_minimal() 

This graph shows the number of times a type of pollutant was the main pollutant in the summer of 2023. It also shows the average AQI through the color scale on the right side of the graph. The darker the color, the worse the AQI is when that pollutant is the most common. Interestingly, PM2.5 dominates as the most common pollutant by far. However, Ozone seems to cause the worst AQI, as you can see by the darkest color of the bar. Finally S02 is rarely the AQI, and is barely visible on the graph. Using this information, it would appear that Taiwan should primarirly focus on reducing the amount of PM2.5 and Ozone particles in order to improve their air quality.

2. Has Taiwan’s air quality improved over time?

air.date |>
  mutate(windspeed = as.double(windspeed), Year = as.numeric(Year), Month = as.numeric(Month)) |>
  drop_na(aqi, windspeed, Date) |>
  group_by(Year, Month) |>
  summarise(aqi.bar = mean(aqi), windspeed.bar = mean(windspeed)) -> air.time
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `windspeed = as.double(windspeed)`.
Caused by warning:
! NAs introduced by coercion
`summarise()` has grouped output by 'Year'. You can override using the
`.groups` argument.
air.time
air.time |>
  drop_na(aqi.bar, windspeed.bar) |>
  mutate(Month = factor(Month, levels = 1:12)) |>  # Convert month to factor
  ggplot(aes(Month, aqi.bar, group = Year)) +  # Group by Year because each graph is one year
  geom_line(color = "purple") +
  theme_classic() +
  ggtitle("Taiwan's AQI Over Time") +
  labs(x = "Month", y = "AQI") +
  facet_wrap(~Year) +
  scale_x_discrete(labels = month.name) +  # Replace factor levels with month names
  scale_y_continuous(expand = c(0, NA)) +   
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) 

This graph shows the average AQI for each month over a time period of several years. Note, there is not much data in 2016, but there are still 2 months of data that can be shown. The peak of this data set appears to have happened in 2017, when the AQI was over 80. The lowest the AQI ever hits is below 30 in 2022. Overall, looking at all of the years in this data set, it does appear that the air quality is improving, albeit slowly. However, the difference is so small, that it may not be statistically significant. Most interestingly, June, July, and August always have the lowest AQI. These are the summer months in Taiwan, so I would guess that warmer temperatures might lead to lower AQI. Unfortunately, temperature is not a variable in this data set, so I have no direct way to verify this.

air.time |>
  drop_na(aqi.bar, windspeed.bar) |>
  mutate(Month = factor(Month, levels = 1:12)) |>  # Convert month to factor
  ggplot(aes(x = Month, y = aqi.bar, group = Year, color = as.factor(Year))) +  # Group and color by Year
  geom_line() +
  theme_classic() +
  ggtitle("Taiwan's AQI Over Time") +
  labs(x = "Month", y = "AQI", color = "Year") +  # Add legend for Year
  scale_x_discrete(labels = month.name) +  # Replace factor levels with month names
  scale_y_continuous(expand = c(0, NA)) +   
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

This graph shows the same data as the previous graph, but in a different format. Here, it is much easier to see that all of the years have the same aqi trends. The summer months have better air quality than colder months do.

Let’s make one more graph using this data, this time a heat map.

air.time |>
  drop_na(aqi.bar, windspeed.bar) |>
  mutate(Month = factor(Month, levels = 1:12), Year = factor(Year)  
  ) |>
  ggplot(aes(x = Month, y = Year, fill = aqi.bar)) +  
  geom_tile() +
  theme_classic() +
  ggtitle("Taiwan's AQI Over Time") +
  labs(x = "Month", y = "Year", fill = "AQI") +  # Add legend for AQI
  scale_x_discrete(labels = month.name) +  # Replace factor levels with month names
  scale_fill_gradient(low = "blue", high = "orange") +  # Color gradient for AQI
  theme(axis.text.x = element_text(angle = 45, hjust = 1))  # Rotate x-axis labels

Ok, this is just another cool way to showcase the same data, this time as a heat map. The center band is all blue. This is the maps’s representation of the dip in the aqi levels for the summer months. You can also see that the winter months here have worse AQI than any other. I am guessing that temperature might have to do with aqi, but that variable is not in the data set. Anyway, this might be the best way to look at this data as it is colorful, and easy to read, but still very useful!

3. Is there any relationship between N0X (Nitrogen Oxide) and CO (Carbon Monoxide)? If so, what does the linear model say about their relationship?

For this question, I am going to look at the past two years in the least analyzed counties.

air.date |> filter(Year == 2024 | Year == 2023) |>
  drop_na(nox, co) |>
  group_by(county) |>
  summarise(count = n()) |>
  arrange(desc(count))

Now that we know the four most common neighborhoods in this data set, let’s clean up the data more.

air.date |>
  filter(Year == 2023 & Month == 12) 

And let’s do more cleaning and take a sample.

air.date |>
  drop_na(nox, co) |>
  filter(county == "Nantou County" | county == "Yunlin County" | county == "Changhua County" | county == "Pingtung County")|>
  filter(Year == 2023) |>
  filter(Month == 12) |>
  mutate(nox = as.double(nox), co = as.double(co)) |>
  sample_n(750) -> air.nox
air.nox

Now we can graph it!

air.nox |>
  ggplot(aes(co, nox)) +
  geom_point(color = "deepskyblue") +
  facet_wrap(~county) +
  geom_smooth(method = "lm", se = FALSE)+
  scale_color_manual(values = "deepskyblue") +
  ggtitle("December 2023 Carbon Monoxide vs Nitrogon Oxide") +
  labs(x = "Carbon Monoxide, ppb", y = "Nitrogen Oxide, ppm") +
  theme(legend.position = "none") +
  scale_x_continuous(
    breaks = c(0, 0.2, 0.4, 0.6, 0.8, 1.0, 1.2)  # Custom tick positions
  ) +
  theme_bw()
`geom_smooth()` using formula = 'y ~ x'

Linear Model for all of the data sets:

lm(nox ~ co, data = air.nox) -> nox.modelData # order is y, then x
nox.modelData

Call:
lm(formula = nox ~ co, data = air.nox)

Coefficients:
(Intercept)           co  
     -2.055       41.064  

Now I am going to make a model for each county.

air.nox |>
  filter(county == "Nantou County") -> air.nan
lm(nox ~ co, data = air.nan) -> nan.modelData # order is y, then x
nan.modelData

Call:
lm(formula = nox ~ co, data = air.nan)

Coefficients:
(Intercept)           co  
     0.5758      32.8481  
air.nox |>
  filter(county == "Changhua County") -> air.chang
lm(nox ~ co, data = air.chang) -> chang.modelData # order is y, then x
chang.modelData

Call:
lm(formula = nox ~ co, data = air.chang)

Coefficients:
(Intercept)           co  
   -0.07843     38.19917  
air.nox |>
  filter(county == "Pingtung County") -> air.ping
lm(nox ~ co, data = air.ping) -> ping.modelData # order is y, then x
ping.modelData

Call:
lm(formula = nox ~ co, data = air.ping)

Coefficients:
(Intercept)           co  
     -4.147       44.999  
air.nox |>
  filter(county == "Yunlin County") -> air.yun
lm(nox ~ co, data = air.yun) -> yun.modelData # order is y, then x
yun.modelData

Call:
lm(formula = nox ~ co, data = air.yun)

Coefficients:
(Intercept)           co  
     -2.522       42.314  

This figure shows the relationship between Carbon Monoxide and Nitrogen Oxide. Specifically, this graph is for December 2023, for 4 counties. All 4 counties have roughly the same outcome, a decently strong correlation between the two variables. Looking at the graph, you can see that as CO increases, so does NOX. This makes sense, as both pollutants come from combustion engines. This means that the more cars are driving around, the more NOX and CO there will be. Note, that this correlation does NOT mean that these two variables are causing each other. Rather, they are simply correlated and both tend to increase or decrease at the same time, probably depending on how many cars are driving on the road. Finally, I made a linear model for all 4 counties, and 4 models for each individual counties, using CO as a predictor variable for NOX. The y-intercept is -2.368, which is the model’s prediction of NOX if the amount of CO is 0. The slope of the line is 41.279. This refers to how steep the linear model is. More explicitly, for ever one unit of CO, NOX increases by about 41 units, according to the model. The other models slopes are about 41, 28, 43, and 39. Besides Nantou’s counties 28, all of the counties have roughly the same slope.

4. Does Northern or Southern Taiwan have worse AQI? What about East and West?

Let’s graph Taiwan!

First, let’s get the data that we want on the graph, the average AQI and the average windspeed for each county.

air.date |>
  filter(Year == 2023) |>
  filter(Month == "12") |>
  drop_na(aqi, county, windspeed) |>
  group_by(county) |>
  mutate(
    # Removing "City" and "County" suffixes to match List 2
    county = gsub(" (City|County)$", "", county),
    
    # Renaming county column to match List 2
    county = case_when(
      county == "Lienchiang" ~ "Lienkiang",    # Rename 'Lienchiang' to 'Lienkiang'
      county == "Chiayi" ~ "Chiayi County",      # Rename 'Chiayi' to 'Chiayi City'
      county == "Hsinchu" ~ "Hsinchu County",    # Rename 'Hsinchu' to 'Hsinchu City'
      county == "Yunlin" ~ "Yulin",            # Rename 'Yunlin' to 'Yulin'
      TRUE ~ county  # Keep the rest unchanged
    )
  ) |>
  rename(NAME_2 = county) |>
  summarise(aqi.bar = mean(aqi), windspeed.bar = mean(windspeed)) -> air.map
Warning: There were 20 warnings in `summarise()`.
The first warning was:
ℹ In argument: `windspeed.bar = mean(windspeed)`.
ℹ In group 1: `NAME_2 = "Changhua"`.
Caused by warning in `mean.default()`:
! argument is not numeric or logical: returning NA
ℹ Run `dplyr::last_dplyr_warnings()` to see the 19 remaining warnings.
air.map

Let’s look at our file.

taiwan <- st_layers("gadm41_TWN.gpkg")
taiwan

We need to compare the county names in our data set and the file we downloaded.

unique(air.map$NAME_2)
 [1] "Changhua"       "Chiayi County"  "Hsinchu County" "Hualien"       
 [5] "Kaohsiung"      "Keelung"        "Kinmen"         "Lienkiang"     
 [9] "Miaoli"         "Nantou"         "New Taipei"     "Penghu"        
[13] "Pingtung"       "Taichung"       "Tainan"         "Taipei"        
[17] "Taitung"        "Taoyuan"        "Yilan"          "Yulin"         
taiwan <- st_read("gadm41_TWN.gpkg", layer = "ADM_ADM_2")
Reading layer `ADM_ADM_2' from data source 
  `/Users/mitchlevy/Marist/Fall 2024/Data Visualization/gadm41_TWN.gpkg' 
  using driver `GPKG'
Simple feature collection with 22 features and 13 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 116.71 ymin: 20.6975 xmax: 122.1085 ymax: 26.38542
Geodetic CRS:  WGS 84
taiwan
unique(taiwan$NAME_2)
 [1] "Kinmen"         "Lienkiang"      "Kaohsiung"      "New Taipei"    
 [5] "Taichung"       "Tainan"         "Taipei"         "Changhua"      
 [9] "Chiayi City"    "Chiayi County"  "Hsinchu City"   "Hsinchu County"
[13] "Hualien"        "Keelung"        "Miaoli"         "Nantou"        
[17] "Penghu"         "Pingtung"       "Taitung"        "Taoyuan"       
[21] "Yilan"          "Yulin"         

Now we can combine the two data sets!

taiwan.new <- left_join(taiwan, air.map, by = "NAME_2")
taiwan.new

Finally, we can graph them.

taiwan.new |>
  ggplot(aes(fill = aqi.bar)) +
  geom_sf(show.legend = TRUE) +
  coord_sf(xlim = c(118, 122.1), ylim = c(21.7, 26.5)) +
  ggtitle("Map of Taiwan's AQI") +
  scale_fill_gradientn(colors = brewer.pal(7, "YlOrRd")) +
  labs(fill = "Average AQI")

As you can see, the further south that you go, the worse the AQI seems to get, except along the East coast. After doing a small amount of research, Taiwan has a huge mountain range called Chungyang/Central Mountain Range that separates the East and West of this country. Because of this mountain range, the population is mainly on the West coast. This explains why the average AQI is so much higher on the West, there are more people living there.

5. In 2023, in New Taipei City, is there a correlation between the PM2.5 and AQI?

First we need to get the data ready.

air.date |>
  filter(county == "New Taipei City" & Year == 2023 & aqi <= 100) |>
  drop_na(pm2.5, Month, aqi) |>
  mutate(pm2.5 = as.numeric(pm2.5)) |>
  group_by(Month) |>
  sample_n(200) -> air.wind
air.wind

Now we can graph it with an animation!

air.wind |>
  ggplot(aes(x = pm2.5, y = aqi)) +
  geom_point(color = "orange", size = 3) +
  geom_smooth(method = "lm") +
  transition_time(as.numeric(Month)) +  # Ensure Month is numeric
  labs(
    title = "Month: {month.name[frame_time]}",  # Use month name as text 
    x = "PM2.5", y = "AQI") +
  theme_minimal() -> air.animation
animate(air.animation, enderer = gifski_renderer())
`geom_smooth()` using formula = 'y ~ x'

6. Since we know the Southern area of Taiwan has worse AQI, is there any correlation between the latitude/longitude and windspeed?

First, we have to clean the data!

air.date |>
  filter(Year == 2023 & Month == "07") |>
  filter(windspeed < 3) |>
  drop_na(windspeed, aqi, latitude) |>
  mutate(windspeed = as.numeric(windspeed)) |>
  group_by(county) |>
  summarise(aqi.bar = mean(aqi), latitude = mean(latitude), longitude = mean(longitude), windspeed.bar =        mean(windspeed)) |>
  drop_na(windspeed.bar)-> air.final
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `windspeed = as.numeric(windspeed)`.
Caused by warning:
! NAs introduced by coercion
air.final

Now, let’s graph it!

ggplot(air.final, aes(x = latitude, y = windspeed.bar, group = 1)) +
  geom_smooth(aes(latitude, windspeed.bar), method = "lm") +
  geom_point(aes(size = aqi.bar), alpha = 0.5, color = "goldenrod1") +  
  scale_size_continuous(name = "AQI Value", range = c(3, 15)) +  
  labs(title = "Latitude vs. Windspeed in New Taipei City, 2023", x = "Latitude", y = "Average Windspeed") +
  theme_bw()  
`geom_smooth()` using formula = 'y ~ x'

ggplot(air.final, aes(x = longitude, y = windspeed.bar, group = 1)) +
  geom_smooth(aes(longitude, windspeed.bar), method = "lm") +
  geom_point(aes(size = aqi.bar), alpha = 0.5, color = "goldenrod1") +  
  scale_size_continuous(name = "AQI Value", range = c(3, 15)) +  
  labs(title = "Longitude vs. Windspeed in New Taipei City, 2023", x = "Longitude", y = "Average Windspeed") +
  theme_bw()  
`geom_smooth()` using formula = 'y ~ x'

These graphs shows how latitude, longitude, wind speed, and aqi are all related. Interestingly, it appears that there is a slight trend between latitude and wind speed. As latitude increases, so does the wind speed, which is a direct relationship. There is also a trend between longitude and wind speed. As longitude increases, the wind speed decreases. This is an inverse relationship. However, it appears that the aqi does not have much variance on either of the two graphs. So, according to these figures, longitude and latitude does not affect aqi, even though it does affect wind speed. However, according to external research, aqi is affected by wind speed, so I wonder why that is not the case for this specific data set.

#7. In general, how was Lienchiang’s air quality in the winter of 2023?

First, let’s clean the data.

air.date |>
  filter(county == "Lienchiang County" & Year == 2023) |>
  filter(Month %in% c("12", "01", "02", "03")) |>
  drop_na(Month, aqi) |>
  mutate(
    Month = case_when(
      Month == "01" ~ "January",
      Month == "02" ~ "February",
      Month == "03" ~ "March",
      Month == "12" ~ "December",
      TRUE ~ Month  # Keep as is for unexpected values
    ),
    aqiCat = factor(case_when(
      aqi <= 50 ~ "Good",
      aqi > 50 & aqi <= 100 ~ "Moderate",
      aqi > 100 & aqi <= 150 ~ "Unhealthy for Sensitive Groups",
      aqi > 150 & aqi <= 200 ~ "Unhealthy",
      aqi > 200 & aqi <= 300 ~ "Very Unhealthy",
      aqi > 300 ~ "Hazardous",
      TRUE ~ NA_character_  # Default case for missing or unexpected values
    ), levels = c("Good", "Moderate", "Unhealthy for Sensitive Groups", "Unhealthy", "Very Unhealthy", "Hazardous"))
  ) |>
  group_by(Month, aqiCat) |>
  summarise(count = n(), .groups = "drop") |>
  complete(Month, aqiCat, fill = list(count = 0)) -> air.levels
air.levels
air.levels |>
  ggplot(aes(aqiCat, count, fill = aqiCat)) +  # Map aqiCat to fill
  geom_col() + 
  facet_wrap(~Month) +
  ggtitle("Lienchiang's AQI in the winter of 2023") +
  labs(x = "Air Quality Categories", y = "Count") +
  scale_fill_manual(values = c("green", "yellow", "orange", "red", "purple", "maroon")) +  # Use custom colors
  theme(axis.text.x = element_text(angle = 20, hjust = 1),
        legend.position = "none"  # Optionally hide the legend if redundant
  )

Here is the graph of Lienchiang’s aqi in the winter of 2023. I chose these colors because they match the official aqi level colors that I found online. Looking at this graph, one can see that the aqi levels are mostly in the moderate category. This is somewhat surprising, I expected these values to be much worse. One positive thing is that this winter never had any very unhealthy or hazardous readings. Also, almost every month in the figure has the same story where the moderate category dominates, while the good category is second, except for February. In that month, most of the aqi readings were actually good! Overall, some very interesting findings in this data set.